Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

نویسنده

Jan Poland

چکیده

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (“experts”), under partial observation: In each round t , only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as t 1 2 in the time t . In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O(t 1 2), i.e. arbitrarily close to optimal order. To this aim, we build on the “follow the perturbed leader” principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner’s decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t 1 2+ε relative to such strong strategies, and even sublinear regret not exceeding O(t 4 5 ) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy. c © 2008 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the k levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstoch...

متن کامل

Delay and Cooperation in Nonstochastic Bandits

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce EXP3-COOP, a cooperative version of the EXP3 algorithm and prove that with K acti...

متن کامل

Bandit Regret Scaling with the Effective Loss Range

We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or adv...

متن کامل

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

We present and study a partial-information model of online learning, where a decision makerrepeatedly chooses from a finite set of actions, and observes some subset of the associated losses.This naturally models several situations where the losses of different actions are related, andknowing the loss of one action provides information on the loss of other actions. Moreover, it<l...

متن کامل

An Analysis of Transient Markov Decision Processes

This paper is concerned with the analysis ofMarkov decision processes in which a natural form of termination ensures that the expected future costs are bounded, at least under some policies. Whereas most previous analyses have restricted attention to the case where the set of states is finite, this paper analyses the case where the set of states is not necessarily finite or even countable. It i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Theor. Comput. Sci.

دوره 397 شماره

صفحات -

تاریخ انتشار 2008

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

نویسنده

چکیده

منابع مشابه

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Delay and Cooperation in Nonstochastic Bandits

Bandit Regret Scaling with the Effective Loss Range

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

An Analysis of Transient Markov Decision Processes

عنوان ژورنال:

اشتراک گذاری